CIS 432 Homework 5

Simon Business School


CIS432 [33B]
PREDICTIVE ANALYTICS USING PYTHON
Prof Yaron Shaposhnik
Spring B
Homework 5
Name Email-Id Student No.
Neeraja Menon nmenon2_simon@simon.rochester.edu 31318631
Neha Jayachandran njayacha@simon.rochester.edu 32275841
Saptarishi Pandey spandey3@simon.rochester.edu 32371505

Introduction

In this assignment, we will develop a predictive model decision support system that evaluates the risk of Home Equity Line Of Credit applications (HELOC).

Based on this predictive model, we intend to design an interactive interface that sales representatives in a bank/credit card company can use to decide on whether to accept or reject an application.


About the dataset

To familiarize ourselves with the use case and the nuances of the dataset, we further researched on the dataset provided to us. Based on the FICO website, we found that credit scores are important to consider when financial institutions evaluate the risk on loans. The scores are designed to predict the likelihood of loan repayment. When a loan is rejected, regulators require the institution to inform the customers why their loan application is rejected in the first place. Customers demand explanations for their scores. If models are not interpretable, they are unlikely to be deployed in the real world as they do not meet the regulatory standards. Thus, one of our goals with our predictive model is to make it interpretable so the sales representative using the interface can give a reasonable explanation as to why the customer’s loan application is rejected.

HELOC

The HELOC allows property owners to take loans using the equity in their property as collateral. When a customer applies for a HELOC, the financial institution appraises their property and subtracts any mortgages. The remainder is the home equity which becomes the maximum amount that can be borrowed (credit limit). Because a home is often a consumer’s most valuable asset, many homeowners use HELOCs only for major items such as home improvement, medical bills, or education, unlike a credit card that is generally used for day-to-day expenses.

The customers in this dataset have requested a credit line in the range of $5000 through $150,000. The objective of the model is to predict whether the customers will repay their HELOC account within 2 years. This prediction is then used to decide whether the homeowner qualifies for a HELOC, and if so, how much the credit should be extended.

The target variable to predict is a binary called RiskPerformance. Bad RiskPerformance means that the consumer was 90 days pas t due or worse at least once over a period of 20 months when the creid account was opened. Good RiskPerformance means that they have made their payments without ever being more than 90 days overdue.

Variable Explanation
RiskPerformance Binary target
ExternalRiskEstimate Consolidated indicator of risk markers
MSinceOldestTradeOpen Number of months that have elapsed since the first trade
MSinceMostRecentTradeOpen Number of months that have elapsed since the last opened trade
AverageMInFile Average months in file
NumSatisfactoryTrades Number of satisfactory trades
NumTrades60Ever2DerogPubRec Number of trades which are more than 60 days past due
NumTrades90Ever2DerogPubRec Number of trades which are more than 90 days past due
PercentTradesNeverDelq Percent of trades, that were not delinquent
MSinceMostRecentDelq1 Number of months that have elapsed since last deliquent trade
MaxDelq2PublicRecLast12M The longest delinquency period in the last 2 months
MaxDelqEver The longest delinquency period
NumTotalTrades Total number of trades
NumTradesOpeninLast12M Number of trades opened in the last 12 months
PercentInstallTrades Percent of installment trades
MSinceMostRecentInqexcl7days Months since last inquiry (excluding the last 7 days)
NumInqLast6M Number of inquiries in the last 6 months
NumInqLast6Mexcl7days Number of inquiries in last 6 months (excluding the last 7 days)
NetFractionRevolvingBurden Revolving balance divided by credit limit
NetFractionInstallBurden Installment balance divided by original loan amount
NumRevolvingTradesWBalance number of revolving trades with balance
NumInstallTradesWBalance number of installment trades with balance
NumBank2NatlTradesWHighUtilization number of trades with high utilization ratio (the amount of a credit card balance compared to the credit limit)
PercentTradesWBalance percent of trades with balance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (0.20.1)

Data exploration

Upon observing the data, we could notice some interesting trends in the data. Clearly there are special encodings for certain values. Many features have encodings like -7. -8, or -9 in them.

Let’s create the training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.2,
                                                    random_state = 1234)

Preprocessing

Special Values

Since this case is clearly the same as the case we had in Assignment 2 (there are special encodings for missing data). We created special pipelines to deal with them.

From the data dictionary, we know that

Value Explanation
-9 No Bureau record or investigation
-8 No usable/valid trades or enquiries
-7 Condition not met (no inquiries/no delinquencies)

We could see that when the value is -9, it meant that there was no record or no investigation. We thus designed our model to not predict values for -9.

The pairplot below has more details about each feature with another.

Special values in Maximum Delinquencies

We also noted that the columns related to max delinquencies are categorical variables and they have special meanings too.

MaxDelq2PublicRecLast12M

Value Explanation
0 Derogatory Comment
1 120+ days delinquent
2 90 days delinquent
3 60 days delinquent
4 30 days delinquent
5, 6 Unknown delinquency
7 Current and never delinquent
8, 9 All other

MaxDelqEver

Value Explanation
1 No such value
2 Derogatory comment
3 120+ days delinquent
4 90 days delinquent
5 60 days delinquent
6 30 days delinquent
7 Unknown delinquency
8 Current and never delinquent
9 All other

To deal with these special values, along with the special values of -7 and -8, we created transformers.

Before we created the pipeline, we used techniques such as the Do nothing Imputer and the OneHot encoder to deal with informative missing values and categorical variable conversion to numerical representation respectively. We also dealt with the special encodings by adding dummy features for all -7 and -8 values. Once this was set, we proceeded with the model training.

Creating pipelines

To preprocess the data, we needed to create pipelines. To this end, we dealt with special values using special encoding techniques, imputed mean values, and also dealt with categorical variables.

Do nothing Imputer

This works the same way as the do nothing imputer did in Assignment 3.

OneHot encoder

As discussed earlier, we now encoded the categorical features using OneHotEncoder.

Missing indicators

As discussed earlier, we then dealt with the special encodings. We added dummy features for all -7 and -8 values.

Scaler

We standard scaled the columns that require scaling (all the non OneHot columns) We did this with another ColumnTransformer object.

Pipelines

We put all the column transformers and feature expansions into a pipeline for smooth preprocessing. We fit the pipeline to the training data and transformed it accordingly. Then we transformed the test data with this pipeline.

After preprocessing the data, the following are the colum names we had.

train_data_t_sc.columns
Index(['RiskPerformance', 'MaxDelq2PublicRecLast12M_0',
       'MaxDelq2PublicRecLast12M_1', 'MaxDelq2PublicRecLast12M_2',
       'MaxDelq2PublicRecLast12M_3', 'MaxDelq2PublicRecLast12M_4',
       'MaxDelq2PublicRecLast12M_5', 'MaxDelq2PublicRecLast12M_6',
       'MaxDelq2PublicRecLast12M_7', 'MaxDelq2PublicRecLast12M_9',
       'MaxDelqEver_2', 'MaxDelqEver_3', 'MaxDelqEver_4', 'MaxDelqEver_5',
       'MaxDelqEver_6', 'MaxDelqEver_7', 'MaxDelqEver_8',
       'ExternalRiskEstimate', 'MSinceOldestTradeOpen',
       'MSinceMostRecentTradeOpen', 'AverageMInFile', 'NumSatisfactoryTrades',
       'NumTrades60Ever2DerogPubRec', 'NumTrades90Ever2DerogPubRec',
       'PercentTradesNeverDelq', 'MSinceMostRecentDelq', 'NumTotalTrades',
       'NumTradesOpeninLast12M', 'PercentInstallTrades',
       'MSinceMostRecentInqexcl7days', 'NumInqLast6M', 'NumInqLast6Mexcl7days',
       'NetFractionRevolvingBurden', 'NetFractionInstallBurden',
       'NumRevolvingTradesWBalance', 'NumInstallTradesWBalance',
       'NumBank2NatlTradesWHighUtilization', 'PercentTradesWBalance',
       'minus7MSinceMostRecentDelq', 'minus7MSinceMostRecentInqexcl7days',
       'minus8MSinceOldestTradeOpen', 'minus8MSinceMostRecentDelq',
       'minus8MSinceMostRecentInqexcl7days',
       'minus8NetFractionRevolvingBurden', 'minus8NetFractionInstallBurden',
       'minus8NumRevolvingTradesWBalance', 'minus8NumInstallTradesWBalance',
       'minus8NumBank2NatlTradesWHighUtilization',
       'minus8PercentTradesWBalance'],
      dtype='object')
# Model training
Since interpretability is paramount for this specific case of credit risk, we focused on simpler, easier to explain models. Thus, we did not consider ensemble models (forests, bagging, boosting), although they may have higher accuracy.
The predictive model we chose is the Logistic Regression Classifier. Being a logistic model, it is a straightforward and an interpretable model that estimates the probability of the target variable, RiskPerformance, taking a certain value, either “Good” or “Bad” based on the input features. This makes it easier to understand the relationship between the input features and RiskPerformance. In addition, the simplicity of the linear regression classifier allows us to train it relatively faster and easier compared to other complex models, while still achieving a reasonably good predictive performance. In addition to that, logistic regression classifier models are effective for binary outcome predictions. Given that our target variable, RiskPerformance, is a binary variable, this model is a good choice to do the prediction.
Furthermore, in determining whether the logistic regression model was the better model to use, we ran other models to compare the performance metrics, such as accuracy. Our model’s accuracy was at 71.65%, which was significantly higher than other models’ accuracy rates. The Support Vector Machine (SVM) model had an accuracy rate of 50% while the Support Vector Classification (SVC) model had an accuracy rate of 57%.
### Choice of performance metric
We chose the metric True Positive Rate over Accuracy for this project.
We found that our True Positive Rate (TPR), or Recall, was 78.18%, which was the highest TPR that we were experiencing when compared to other models. In this case, the true positive rate is significant as it helps banks or credit companies identify whether their prediction model for loan default is capable of predicting whether consumers are likely to pay their loan. Having a high positive rate indicates that the model is effective in identifying individuals who are likely to default on a loan. If the TPR is higher, it means that the model is correctly able to capture a higher proportion of bad risk customers, which is very important in the case of credit risk.
### Logistic regression classifier
First we will try out the logistic regression classifier. We will work with linear models because of their interpretability.
::: {.cell outputId=‘641571b8-e017-42a8-a1c4-75e6bf430187’}
::: {.cell-output .cell-output-stdout}
::: {.cell outputId=‘d24d18e0-7e5a-4a36-dfd6-4b15ab26e9cd’}
::: {.cell-output .cell-output-display execution_count=289}
::: {.cell-output .cell-output-display} ::: :::
::: {.cell outputId=‘87807914-6e54-454f-f8d9-c7ef2ec09231’}
::: {.cell-output .cell-output-display} ::: :::
Let us look at the coefficients of the regression
::: {.cell outputId=‘f1fbc80f-fe1a-45c9-8f47-bc66b5c5f193’}
::: {.cell-output .cell-output-display execution_count=368}
::: {.cell-output .cell-output-display} ::: :::
Some of these coefficients are equal to 0, but there is something interesting about Months since last inquiry (excluding the last 7 days) categorical variables. Especially when the value is 8
::: {.cell outputId=‘626fb97b-eb51-4328-cdd5-c2df5d4fe102’}
::: {.cell-output .cell-output-display} ::: :::
Upon further investigation, it doesn’t seem like this would help much when it comes to classificatino of risk performance, as there is not a significant difference between the distributions of risk performance and the various categories of this feature. Thus it will not help us much in the case of classification.
### SVM Classifier
::: {.cell outputId=‘7a204b3b-11a8-40f8-ec23-2f7c4a284c2b’}
::: {.cell-output .cell-output-stdout}
::: {.cell outputId=‘f5910b4a-5e9b-4b77-82d4-a9f07c5e63ac’}
::: {.cell-output .cell-output-display execution_count=375}
::: {.cell-output .cell-output-display} ::: :::
::: {.cell outputId=‘b5483055-a819-43d7-f1c8-ea3c1e6a5ae1’}
::: {.cell-output .cell-output-display} ::: :::

Decision tree

Accuracy of the Decision tree stump classifier: 0.6955066921606119

This tree stump performs worse than the other models we have seen so far, but we might have something to ensemble with.

<function matplotlib.pyplot.show(close=None, block=None)>

The recall performance of this model is relatively good considering that it is a single tree.

The only problem from this is interpretability. While this model performs rather well (high recall), we cannot choose it as it would not work well in real life. Thus, we will go with the relatively simpler but comparable model: logistic regression classifier.


Conclusion

We are considering the logistic regression classifier for this project, because of its easy interpretability and decent recall performance. The \(\beta\) can be inspected in the code file of this report.

However, it has some limitations still, which we will discuss now.

Limitations

While we were able to achieve a decent accuracy with our model, we realize that our model may have some limitations that may affect the prediction aspect of it. Logistic Regression Classifier may have limited flexibility, i.e, it may not capture any complex relationships between the variables that could have an impact on whether a consumer will default on loan payment or not.

Furthermore, our accuracy could have been higher, in order to be a more effective model in predicting loan defaulters more accurately. Lastly, the model is not fully interpretable as desired due to the limitations it faces with dealing with complex relationships between the variables and assumptions such as the linearity aspect of it.


lr_grid_search.best_estimator_
LogisticRegression(C=0.1, penalty='l1', solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.